Unboxing ChatGPT: A Deep-Dive on How This AI-Driven Chatbot Was Trained
In this article, we explore ChatGPT from OpenAI, including how it works, how it was trained, and the role reinforcement learning from human feedback (RLHF) plays.
Created on January 25|Last edited on March 1
Comment
ChatGPT, OpenAI's latest dialogue model, has taken the internet by storm, surpassing 1 million users in just 5 days. From seamless chatting to creating poetry and from writing code to conceiving an imaginary OS, its performance is truly mind-blowing.
How did conversational AI become so much better so quickly? OpenAI appears to have cracked the nut using Reinforcement Learning with Human Feedback (RLHF) – a method that uses human demonstrations to guide the model toward desired behavior.
Why not just ask ChatGPT?

Not exactly Ezra Pound but pretty impressive all the same. If you haven’t tried it yet, you can check out the research preview made available for free by OpenAI here. In this article, we'll unpack ChatGPT's training techniques and take a deeper look at what goes on under the hood.
Since there's been no public reveal about the fundamental details of ChatGPT’s training, this piece will be based on the InstructGPT paper. That's because ChatGPT is a sibling model to InstructGPT, which was also trained to follow instructions in a prompt and provide a detailed response. According to OpenAI, ChatGPT is trained "using the same methods as InstructGPT, but with slight differences in the data collection setup."
Here's what we'll be covering:
Table of Contents
The Linguistic Alignment ProblemDistilling Desired Values Through RLHFLimitations of RLHFConclusion and the FutureReferences
Let's get started!
The Linguistic Alignment Problem
How are Large Language Models (LLMs) Currently Trained?
LLMs, such as generative pretrained transformers (GPTs), are transformer models trained to recognize linguistic patterns from a broad distribution of internet data. By predicting the next word in a phrase, or a masked word in a sentence, LLMs learn the statistics of word usage without human labels.
However, despite achieving high accuracy in these tasks, they fail to understand the deeper context in which a sentence is stated (unhelpful), produce untruthful (dishonest) or toxic (harmful) outputs at times, and are awful for a conversation. Less sophisticated models might be prompted with “how to bully someone?”, and provide suggestions rather than refusing to respond. Averting this misalignment problem is especially important for language models deployed and used in hundreds of applications.
Why Won’t the Usual Fine-Tuning Methods Work?
Since the pre-trained models generate reasonably coherent text snippets, we somehow need to infuse our desired values through fine-tuning. Labeling a large demonstration dataset to capture human preferences is exceedingly expensive, and neither is it feasible to capture them using an objective function. However, it’s relatively easier for labelers to provide feedback like scores or ranks on the generated model responses instead of writing high-quality dialogues.
RLHF to the Rescue
We'll require a training technique that is efficient in exploring and exploiting the small number of feedbacks enriched with the values of the human environment. Does the previous sentence ring any bells? We are talking about using Reinforcement Learning (RL) as an efficient technique for the model to learn effectively from human feedback (HF) as reward signals.
Distilling Desired Values Through RLHF
Bird’s Eye View of the Training Procedure and Prompt Dataset
In our context, RLHF leverages an accurate reward model trained to fine-tune a baseline LM agent that generates high-quality outputs as judged by humans. The action space of the RL agent is the vocabulary of the language itself, as it sequentially generates each word of its response.
The state space corresponds to the set of possible input token sequences (length of ~8K tokens) from which a prompt is sampled. The episode ends with the model’s response to a prompt.
Broadly, RHLF is composed of three fundamental steps. First, a GPT series model is fine-tuned with a demonstration dataset to produce a baseline. Next, a reward model is trained with human feedback to rank the baseline's responses. Finally, the trained reward model is used to fine-tune the baseline using RL techniques.
A diverse set of prompts ranging from question answering to summarization were collected and curated into unique training and validation sets for different RLHF steps. Hired labelers wrote some, others were synthetically generated using multiple query-response pairs per labeler instruction, and a few more were created through OpenAI API playground interface users.
Deep Dive Into the Three RLHF Steps
Let’s take a look at the three different steps in more detail:
1. Supervised Fine Tuning (SFT)
- Baseline: SFT model is initialized using a GPT3.5 series model, which is a GPT3 mostly fine-tuned on programming code. This explains how ChatGPT is so good at helping users with writing, summarizing, and debugging code.
- Dataset: Consists of high-quality labeler-written demonstration responses that capture the desired values for the above-mentioned prompt set.
- Training: The above demonstration dataset is used for fine-tuning the SFT model using supervised learning.
- Evaluation: Final SFT model selection is based on the step 2’s reward score on the validation set instead of step 1’s validation loss. This is because step 2’s reward is found to be more predictive of human preference ratings later, which is what matters for our use case.

2. Reward Model (RM) training
- Baseline: The reward model is initialized from a 6B GPT3 and is finetuned on some public NLP datasets. The rationale seems that RM needs to be as complex as the responding agent to evaluate it effectively.
- Dataset: Consists of labeler rankings of K (ranging from 4 to 9) model outputs obtained through different seeds or instances of the model for the same input prompt. The key idea behind step 2 is that obtaining this comparison dataset is relatively easier and less expensive than SFT’s dataset, as ranking model responses is much easier for labelers than generating an “aligned” dataset. Also, the ranking procedure produces (K 2) comparisons for each prompt, so the number of ranked pairs used for training is an order of magnitude larger than the number of prompts.
- Training: RM is trained to take in a prompt, x, and response, y, from the SFT model and output a scalar reward, r_𝜃(x, y), through a final projection layer. For each prompt, the (K 2) comparisons made are put in a single batch for improved computational efficiency (reduces number of forward passes) and validation accuracy (otherwise correlation across batches causes the model to overfit). RM’s loss function is...

...where θ is the RM's parameters, y_w is the preferred response from the pair of y_w and y_l based on rankings. The difference here represents the log odds that one response will be preferred to the other by a human labeler. Since RM loss is invariant to shifts in reward (difference of two reward values), the RM is normalized using a bias so that the labeler demonstrations achieve a mean score of 0 before doing RL.
- Evaluation: Despite a larger 175B RM resulting in lower validation loss, a 6B model was chosen as its training was much stable and efficient.

3. PPO training & evaluation
- Baseline: SFT model as the RL agent/policy.
- Dataset: Consists of input prompt-set without any labels along with some GPT3 pre-training dataset to combat “alignment tax” (discussed below).
- Training: RL environment presents a prompt to the agent and expects a response to the prompt as action. Given the prompt and response, RM produces a reward used to train the agent using PPO, and the episode ends. A KL penalty term is added to penalize the PPO model for over-optimizing and being too far from the SFT model, which happens at times leading to gibberish output. It also encourages the policy to explore and not collapse into a single mode. PPO’s objective function is:

- PPO details: Proximal Policy Optimization (PPO) is an “on-policy” algorithm that uses a clipped advantage function for policy updates so that the updated policy is within a certain distance of the previous policy. The advantage function is the difference between the current return (reward) for action and the expected return of being in that state. The latter is evaluated using a separate value function which is initialized with the trained RM and is updated along with the baseline.
- Engineering hack: An alignment tax was also observed where our alignment procedure comes at the cost of a slight performance degradation (compared to the baseline) over public NLP datasets that we care about. To fix this performance regression, mixing the gradients from the pre-trained baseline along with the PPO gradients was found to be beneficial.
- Final evaluation: Steps 2 and 3 were iterated continuously. More comparison data was collected on the current best policy, which is used to train a new RM and then a new policy. The final PPO model generalizes better than other LMs in adhering to preferences like ‘following instructions,’ ‘truthfulness,’ and ‘less toxicity’ over public NLP datasets, and even when we don't supervise settings!

That's it! Those are the key steps it took to bring ChatGPT to life! The crucial takeaway is that the cost of increasing model alignment through RLHF is modest relative to fine-tuning with a massive demonstration dataset.
However, this is not to say that this approach is a silver bullet, as ChatGPT has shortcomings.
Limitations of RLHF
There are three main limitations here:
- The model is not aligned with “human” preferences but with those of labelers, OpenAI developers, and customers.
- It remains an open question if the improvements we are seeing are actually due to RLHF or just due to the newly available ranking dataset.
An adversarial set-up where labelers or users find the worst-case behaviors of the model, which are then labeled and added to the training dataset, can be used to curtail these limitations to an extent.
Conclusion and the Future
Despite its limitations, ChatGPT continues to amaze its users and can be used as it is or fine-tuned for a variety of business applications. Certainly, we can expect much more improvements from building on its shoulders! There is also so much excitement and anticipation about when GPT4 will be launched.
If a billion-parameter ChatGPT is so good, can you imagine how good a trillion-parameter model will be? Will RLHF become the new norm for model alignment? Chances are, we'll have our answer sooner rather than later.
References
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.